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■ Precision matrix is of significant importance in a wide range of applications in 

multivariate analysis. This paper considers adaptive minimax estimation of sparse 
precision matrices in the high dimensional setting. Optimal rates of convergence are 
' established for a range of matrix norm losses. A fully data driven estimator based 

on adaptive constrained l\ minimization is proposed and its rate of convergence is 
obtained over a collection of parameter spaces. The estimator, called ACLIME, is easy 
, to implement and performs well numerically. 

A major step in establishing the minimax rate of convergence is the derivation of 
a rate-sharp lower bound. A "two-directional" lower bound technique is applied to 
^ obtain the minimax lower bound. The upper and lower bounds together yield the 

' optimal rates of convergence for sparse precision matrix estimation and show that the 

, ACLIME estimator is adaptively minimax rate optimal for a collection of parameter 

\ spaces and a range of matrix norm losses simultaneously. 
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1 Introduction 



Precision matrix plays a fundamental role in many high-dimensional inference problems. 
For example, knowledge of the precision matrix is crucial for classification and discriminant 
analyses. Furthermore, precision matrix is critically useful for a broad range of applications 
such as portfolio optimization, speech recognition, and genomics. See, for example, Lau- 
ritzen (1996), Yuan and Lin (2007), Saon and Chien (2011). Precision matrix is also closely 
connected to the graphical models which are a powerful tool to model the relationships 
among a large number of random variables in a complex system and are used in a wide ar- 
ray of scientific applications. It is well known that recovering the structure of an undirected 
Gaussian graph is equivalent to the recovery of the support of the precision matrix. See 
for example, Lauritzen (1996), Meinshausen and Biihlmann (2006) and Cai, Liu and Luo 
(2011). Liu, Lafferty and Wasserman (2009) extended the result to a more general class of 
distributions called nonparanormal distributions. 

The problem of estimating a large precision matrix and recovering its support has drawn 
considerable recent attention and a number of methods have been introduced. Meinshausen 
and Biihlmann (2006) proposed a neighborhood selection method for recovering the support 
of a precision matrix. Penalized likelihood methods have also been introduced for estimating 
sparse precision matrices. Yuan and Lin (2007) proposed an ii penalized normal likelihood 
estimator and studied its theoretical properties. See also Friedman, Hastie and Tibshirani 

(2008) , d'Aspremont, Banerjee and El Ghaoui (2008), Rothman et al. (2008), Lam and Fan 

(2009) , and Ravikumar et al. (2011). Yuan (2010) applied the Dantzig Selector method to 
estimate the precision matrix and gave the convergence rates for the estimator under the 
matrix ii norm and spectral norm. Cai, Liu and Luo (2011) introduced an estimator called 
CLIME using a constrained ii minimization approach and obtained the rates of convergence 
for estimating the precision matrix under the spectral norm and Frobenius norm. 

Although many methods have been proposed and various rates of convergence have been 
obtained, it is unclear which estimator is optimal for estimating a sparse precision matrix 
in terms of convergence rate. This is due to the fact that the minimax rates of convergence, 
which can serve as a fundamental benchmark for the evaluation of the performance of 
different procedures, is still unknown. The goals of the present paper are to establish the 
optimal minimax rates of convergence for estimating a sparse precision matrix under a 
class of matrix norm losses and to introduce a fully data driven adaptive estimator that is 
simultaneously rate optimal over a collection of parameter spaces for each loss in this class. 

Let Xi, . . . , Xn be a random sample from a p-variate distribution with a covariance 
matrix S = j<p- The goal is to estimate the inverse of S, the precision matrix Q = 

i^ij)i<i j<p- It is well known that in the high-dimensional setting structural assumptions 
are needed in order to consistently estimate the precision matrix. The class of sparse 
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precision matrices, where most of the entries in each row/column are zero or neghgible, is 
of particular importance as it is related to sparse graphs in the Gaussian case. For a matrix 
A and a number 1 < r < oo, the matrix iy^ norm is defined as ||^||ui = supi^i \Ax\ 
In particular, the commonly used spectral norm is the matrix £2 norm. For a symmetric 
matrix A, it is known that the spectral norm ||^||2 is equal to the largest magnitude of 
eigenvalues of A. The sparsity of a precision matrix can be modeled by the iq balls with 

< g < 1. More specifically, we define the parameter space Qq{cn,p, Mn^p) by 

C fc M ) = i ^ " ('^^j)i<^i<p ■ ™J ^i=i l^^^jl" - "^"'P' \ (I) 

,1 n,p, n,p) I ii^ii^ ^ ^^^^^ Xm..m/Xn,inm < Mi,J] ^ J ' 

where < g < 1, Mn,p and Cn,p are positive and bounded away from 0, Mi > is a given 
constant, Amax(f^) and Amin(^^) are the largest and smallest eigenvalues of respectively, 
and cin^ < p < exp {'jn) for some constants /? > 1, ci > and 7 > 0. The notation A >- 
means that A is symmetric and positive definite. In the special case of g = 0, a matrix in 
Qo{cn,p-, Mn,p) has at most c„^p nonzero elements on each row/column. 

Our analysis establishes the minimax rates of convergence for estimating the preci- 
sion matrices over the parameter space Gq{cn,p, M^^p) under the matrix £y^ norm losses for 

1 < w < 00. We shall first introduce a new method using an adaptive constrained ii 
minimization approach for estimating the sparse precision matrices. The estimator, called 
ACLIME, is fully data-driven and easy to implement. The properties of the ACLIME are 
then studied in detail under the matrix norm losses. In particular, we establish the rates 
of convergence for the ACLIME estimator which provide upper bounds for the minimax 
risks. 

A major step in establishing the minimax rates of convergence is the derivation of rate 
sharp lower bounds. As in the case of estimating sparse covariance matrices, conventional 
lower bound techniques, which are designed and well suited for problems with parameters 
that are scalar or vector- valued, fail to yield good results for estimating sparse precision 
matrices under the spectral norm. In the present paper we apply the "two-directional" 
lower bound technique first developed in Cai and Zhou (2012) for estimating sparse covari- 
ance matrices. This lower bound method can be viewed as a simultaneous application of 
Assouad's Lemma along the row direction and Le Cam's method along the column direc- 
tion. The lower bounds match the rates in the upper bounds for the ACLIME estimator 
and thus yield the minimax rates. 

By combining the minimax lower and upper bounds developed in later sections, the 
main results on the optimal rates of convergence for estimating a sparse precision matrix 
under various norms can be summarized in the following theorem. We focus here on the 
exact sparse case of q = 0; the optimal rates for the general case of < q < 1 are given in 
the end of Section HI Here for two sequences of positive numbers an and bn, CLn >i bn means 
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that there exist positive constants c and C independent of n such that c < an/bn < C. 

Theorem 1. Let Xi"^' Np{^,Ti) , i = 1,2, and let 1 < k = o(n2 (logp)~2). The 

minimax risk of estimating the precision matrix Q, = over the class Go{k, Mn,p) based 
on the random sample {Xi, Xn} satisfies 



inf sup E 

f2 go{k,Mn,p) 



2 



Ml^^ (2) 



for all 1 < w < oo. 



In view of Theorem [H the ACLIME estimator, which is fully data driven, attains the 
optimal rates of convergence simultaneously for all fc-sparse precision matrices in the param- 

1 3 

eter spaces Go{k, M„^p) with k <^n2 {\ogp)~^ under the matrix norm for all 1 < < oo. 
As will be seen in Section the adaptivity holds for the general ^q balls Qq{cn^p^ Mn^p) with 
< q < 1. The ACLIME procedure is thus rate optimally adaptive to both the sparsity 
patterns and the loss functions. 

In addition to its theoretical optimality, the ACLIME estimator is computationally easy 
to implement for high dimensional data. It can be computed column by column via linear 
programming and the algorithm is easily scalable. A simulation study is carried out to 
investigate the numerical performance of the ACLIME estimator. The results show that 
the procedure performs favorably in comparison to CLIME. 

Our work on optimal estimation of precision matrix given in the present paper is closely 
connected to a growing literature on estimation of large covariance matrices. Many regular- 
ization methods have been proposed and studied. For example, Bickel and Levina (2008a, 
b) proposed banding and thresholding estimators for estimating bandable and sparse co- 
variance matrices respectively and obtained rate of convergence for the two estimators. See 
also El Karoui (2008) and Lam and Fan (2009). Cai, Zhang and Zhou (2010) established the 
optimal rates of convergence for estimating bandable covariance matrices. Cai and Yuan 
(2012) introduced an adaptive block thresholding estimator which is simultaneously rate 
optimal rate over large collections of bandable covariance matrices. Cai and Zhou (2012) 
obtained the minimax rate of convergence for estimating sparse covariance matrices under 
a range of losses including the spectral norm loss. In particular, a new general lower bound 
technique was developed. Cai and Liu (2011) introduced an adaptive thresholding proce- 
dure for estimating sparse covariance matrices that automatically adjusts to the variability 
of individual entries. 

The rest of the paper is organized as follows. The ACLIME estimator is introduced in 
detail in Section [2] and its theoretical properties are studied in Section [3l In particular, 
a minimax upper bound for estimating sparse precision matrices is obtained. Section H] 
establishes a minimax lower bound which matches the minimax upper bound derived in 
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Section [2] in terms of the convergence rate. The upper and lower bounds together yield 
the optimal minimax rate of convergence. A simulation study is carried out in Section [5] 
to compare the performance of the ACLIME with that of the CLIME estimator. Section 
[6] gives the optimal rate of convergence for estimating sparse precision matrices under the 
Frobenius norm and discusses connections and differences of our work with other related 
problems. The proofs are given in Section [71 

2 Methodology 

In this section we introduce an adaptive constrained ii minimization procedure, called 
ACLIME, for estimating a precision matrix il. The properties of the estimator are then 
studied in Section [3] under the matrix norm losses for 1 < w < oo and a minimax upper 
bound is established. The upper bound together with the lower bound given in Section U] 
will show that the ACLIME estimator is adaptively rate optimal. 

We begin with basic notation and definitions. For a vector a = (ai,... ,ap)^ G MP, 
define |a|i = Yl^=i \^j\ l^U — \jYl^j=i ^j- ^ matrix A = {aij) G RP^"?, we define the 
elementwise norm by \A\r = i^ij laijD^/^ The Frobenius norm of A is the elementwise 
^2 norm. / denotes a p x p identity matrix. For any two index sets T and T' and matrix 
A, we use Arpj,^ to denote the |T| x |T'| matrix with rows and columns of A indexed by T 
and T' respectively. 

For an i.i.d. random sample {Xi, . . . , X„} of p-variate observations drawn from a pop- 
ulation X, let the sample mean X = ^ Ylk=i -^k and the sample covariance matrix 

1=1 

which is an unbiased estimate of the covariance matrix S = (cijOkj j<p- 

It is well known that in the high dimensional setting, the inverse of the sample covari- 
ance matrix either does not exist or is not a good estimator of $7. As mentioned in the 
introduction, a number of methods for estimating have been introduced in the literature. 
In particular, Cai, Liu and Luo (2011) proposed an estimator called CLIME by solving the 
following optimization problem: 

min \VL\i subject to: - /|oo < r„, G (4) 

where r„ = CMn^p^J\ogp/n for some constant C. The convex program ([3]) can be further 
decomposed into p vector-minimization problems. Let Cj be a standard unit vector in W 
with 1 in the i-th coordinate and in all other coordinates. For 1 < i < p, let uji be the 
solution of the following convex optimization problem 

min|a;|i subject to \TjnUJ — ej|oo < Tn, (5) 
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where w is a vector in MP. The final CLIME estimator of $7 is obtained by putting the 
columns Coi together and applying an additional symmetrization step. This estimator is 
easy to implement and possesses a number of desirable properties as shown in Cai, Liu and 
Luo (2011). 

The CLIME estimator has, however, two drawbacks. One is that the estimator is 
not rate optimal, as will be shown later. Another drawback is that the procedure is not 
adaptive in the sense that the tuning parameter A„ is not fully specified and needs to be 
chosen through an empirical method such as cross-validation. 

To overcome these drawbacks of CLIME, we now introduce an adaptive constrained ii- 
minimization for inverse matrix estimation (ACLIME). The estimator is fully data-driven 
and adaptive to the variability of individual entries. A key technical result which provides 
the motivation for the new procedure is the following fact. 

Lemma 1. Let Xi,...,Xn ~ Np{fi,Tj) with logp = 0(n^/^). Set S* = {s*j)i<ij<p = 
— Ipxp, where S* is the sample covariance matrix defined in Then 



Var [si) 



n 



^^(1 + (TiiCJii), fori=j 
n~^aiiujjj, fori^j 



and for all 6 >2, 



p { \{j:*n - /pxp)iil < 6\l yi<i,j<p\>i- o((iogp)-tp-x+i). 



a2 



n 



A major step in the construction of the adaptive data-driven procedure is to make the 
constraint in and ^ adaptive to the variability of individual entries based on Lemma 
m instead of using a single upper bound An for all the entries. In order to apply Lemma 
[U we need to estimate the diagonal elements of E and $1, an and Wjj, i,j = 1, Note 
that an can be easily estimated by the sample variances 0"*^, but ujjj are harder to estimate. 
Hereafter, {A)ij denotes the (i,j)-th entry of the matrix A, {a)j denotes the j-th. element 
of the vector a. Denote bj = {bij, . . . , bpj)' . 

The ACLIME procedure has two steps: The first step is to estimate ujjj and the second 
step is to apply a modified version of the CLIME procedure to take into account of the 
variability of individual entries. 

Step 1: Estimating iVjj. Note that anujjj < {an V ajj)ujjj and {an V ajj)ujjj > 1. So the 
inequality on the left hand side of ([6]) can be relaxed to 



I < 2{an V <yjj)^jj\l^^, 1 <i,j <P- (7) 
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Let Cli := {(^jj) = . . . ,(^^p) be a solution to the following optimization problem: 
Lo^j = argmin \\bj\i : |S6j - e^loo < A„(cj*j V a* ) x bjj, bjj > o \ , (8) 

where bj = {bij, . . . , bpj)' , 1 < j < p, S = S* + n~^Ipxp and 

A. = sJ^. (9) 
V n 

Here (5 is a constant which can be taken as 2. The estimator Qi yields estimates of 
the conditional variance Ujj, 1 < j < p. More specifically, we define the estimates of 
by 

= < ^1^} + > ^1^} ■ 

Step 2: Adaptive estimation. Given the estimates cojj, the final estimator 17 of is con- 
structed as follows. First we obtain Cl^ =: (i^jj) by solving p optimization problems: 
for 1 < j < p 

wl- = argmin(|6|i : ej)i| < A„Y^o^^, ^ < i < p] , (10) 

where A„, is given in ([9|). We then obtain the estimator (l by symmetrizing (l^, 

n = (ujij), where Uij = u}j, = < + > (11) 

We shall call the estimator ft adaptive CLIME, or ACLIME. The estimator adapts to the 
variability of individual entries by using an entry-dependent threshold for each individual 
LVij. Note that the optimization problem ([8]) is convex and can be cast as a linear program. 
The constant 5 in Q can be taken as 2 and the resulting estimator will be shown to be 
adaptively minimax rate optimal for estimating sparse precision matrices. 

Remark 1. Note that 6 = 2 used in the constraint sets is tight, it can not be further 
reduced in general. If one chooses the constant 5 < 2, then with probability tending to 1, 
the true precision matrix will no longer belong to the feasible sets. To see this, consider 
S = 17 = Ipxp for simplicity. It follows from Liu, Lin and Shao (2008) and Cai and Jiang 
(2011) that 

I 

W- max \aij\ 2 

V lOgp l<i<j<p 

in probability. Thus P{\T,Q — Ipxploo > A^) — 1, which means that if (5 < 2, the true 17 lies 
outside of the feasible set with high probability and solving the corresponding minimization 
problem cannot lead to a good estimator of 17. 
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Remark 2. The CLIME estimator uses a universal tuning parameter A„ = C Mn^p\/\ogp / n 
which does not take into account the variations in the variances an and the conditional 
variances ujjj. It will be shown that the convergence rate of CLIME obtained by Cai, Liu 
and Luo (2011) is not optimal. The quantity M„^p is the upper bound of the matrix 
norm which is unknown in practice. The cross validation method can be used to choose the 
tuning parameter in CLIME. However, the estimator obtained through CV can be variable 
and its theoretical properties are unclear. In contrast, the ACLIME procedure proposed in 
the present paper does not depend on any unknown parameters and it will be shown that 
the estimator is minimax rate optimal. 



3 Properties of ACLIME and Minimax Upper Bounds 

We now study the properties of the ACLIME estimator Q proposed in Section [2j We shall 
begin with the Gaussian case where X ~ A^(^, S). Extensions to non-Gaussian distributions 
will be discussed later. The following result shows that the ACLIME estimator adaptively 
attains the convergence rate of 



logp 



n 



(l-'?)/2 



over the class of sparse precision matrices Qqi^c^i^p^ Ad^^p^ defined in ([T]) under the matrix £^ 
norm losses for all 1 < w < oo. The lower bound given in Section U shows that this rate 
is indeed optimal and thus ACLIME adapts to both sparsity patterns and this class of loss 
functions. 

Theorem 2. Suppose we observe a random sample Xi,...,Xn *~ A'p(/z,S). Let Q, = 
be the precision matrix. Let 5 >2, logp = 0{n^^^) and 

Cn,p = O (n^/(logp)^) . (12) 

Then for some constant C > 

inf P - Q\\^ < CMtp''cn,p ']>l-0 f (logp)-ip-^+i') 

for all 1 < w < oo. 

For q = a sufficient condition for estimating consistently under the spectral norm is 



Mn,pCn,p^l = 0{l), i.e., Mn,pCn,p = O i <, 

logp \ V ^ogp 
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This implies that the total number of nonzero elements on each column needs be <C \/n in 
order for the precision matrix to be estimated consistently over Goicn,p, Mn^p). In Theorem 
Owe show that the upper bound Mn^pCn^pyJ^^ is indeed rate optimal over Go{cn,p, Mn,p)- 
We now consider the rate of convergence under the expectation. For technical reasons, 
we require the constant 5 > 3 in this case. 

Theorem 3. Suppose we observe a random sample Xi,...,Xn *~ Np(fi,T,). Let Q, = 
be the precision matrix. Let logp = 0(n^/^) and 6 > 3. Suppose that p > ni3/{5'-8) and 

Cn,q = o((?l/logp)2~f ). 

The A CLIME estimator satisfies, for all I < w < oo and < q < 1, 

n-n 



sup E 

Qq (Cri,p i-^^n,p) 



^ '~'-^^^n,p ^n,p 



n 



for some constant C > 0. 



Theorem [3] can be extended to non-Gaussian distributions. Let Z = {Zi, Z2, . . . , Z^)' 

be a p— variate random variable with mean fj, and covariance matrix T, = (o"jj)]^<j •<p- Let 

1/2 

' ' = be the precision matrix. Define = [Z^ — fii)/a^- , 1 < « < p and 

(Wi, . . . , WpY := il{Z — n). Assume that there exist some positive constants rj and M such 
that for all 1 < z < p, 

Eexp{riY^^) < M, E exp{riW^ / uja) < M. (13) 
Then we have the following result. 

Theorem 4. Suppose we observe an i.i.d. sample with the precision matrix 

satisfying Condition [73|]. Let logp = 0{n^/^), p>n'^ for some 7 > 0. Suppose that 

Cn,q = o((n/logp)^~f ). 

Then there is a 5 depending only on rj, M and 7 such that the ACLIME estimator Vt 
satisfies, for all 1 < w < 00 and < g < 1, 

" nnn I H It-' I / \ 

sup E 



Qq (Cri,p i-^-^n,p) 

for some constant C > 0. 



n-n 



< CM^^p 'ic^^p j 



Remark 3. Under Condition (I13p it can be shown that an analogous result to Lemma [T] 
in Section [2] holds with some 6 depending only on rj and M. Thus, it can be proved that, 
under Condition (jl3p . Theorem U] holds. The proof is similar to that of Theorem [3l A 
practical way to choose 6 is using cross validation. 
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Remark 4. Theorems [21 [3] and U] follow mainly from the convergence rate under the 
element- wise £oo norm and the inequality ||M||^ < ||-/Vf||i for any symmetric matrix M 
from Lemma [8l The convergence rate under element-wise norm plays an important role 
in graphical model selection and in establishing the convergence rate under other matrix 
norms, such as the Probenius norm || • \\f. Indeed, from the proof. Theorems [2l [3] and U] 
hold under the matrix £i norm. More specifically, under the conditions of Theorems [3] and 
I3]we have 

sup m-n\l < CMl^^^, 

a (r M ) 1^ 
Itfq \'-n,p i-'-^-i-n,p J 

log^y"'' 



sup E\\n-n\\l < CMlf^cl 
sup < CM^-Vpri^") 



Qq{Cii.p t^^n.p^ P 

Remark 5. The results in this section can be easily extended to the weak ig ball with 
< g < 1 to model the sparsity of the precision matrix fi. A weak ig ball of radius c in RP 
is defined as follows, 

Bg{c) = {e G : ICIJ;-) < ck-\ for all k = I, , 

where |C|(i) > |^|(2) > ••• > Let 

G*(c M ) = f ^ = K)i<i,i<p:'^-je^'/(cn,p), 1 

gyn,p, n,pj | || || ^ < M„,p, A^ax(J^)/A,,in (f^ ) < A^l , ^ J ' 

Theorems[2l[3]and[l]hold with the parameter space Gg{cn,p, Mn,p) replaced by G*(c„^p, Mn,p) 
by a slight extension of Lemma [7] for the Ig ball to for the weak £q ball similar to Equation 
(51) in Cai and Zhou (2012). 



4 Minimax Lower Bounds 

Theorem [3] shows that the ACLIME estimator adaptively attains the rate of convergence 

(15) 

under the squared matrix 1^ norm loss for 1 < w < co over the collection of the parameter 
spaces Qg{cn,p, Mn^p)- In this section we shall show that the rate of convergence given in 
(jlSp cannot be improved by any other estimator and thus is indeed optimal among all 
estimators by establishing minimax lower bounds for estimating sparse precision matrices 
under the squared matrix i^, norm. 
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Theorem 5. Let Xi, . . . , Xn *~ Np[fi,Tj) with p > cin^ for some constants /3 > 1 and 
ci > 0. Assume that 

cMl^ (i^) ' < c„,, = o (M^,,n^ (logp)-"^) (16) 

for some constant c > 0. The minimax risk for estimating the precision matrix Q = T,^^ 
over the parameter space Gq{cn^p, M^^p) under the condition fl^) satisfies 

1-9 



inf sup E 



for some constant C > and for all 1 < w < oo. 

The proof of Theorem[5]is involved. We shall discuss the key technical tools and outline 
the important steps in the proof of Theorem [5] in this section. The detailed proof is given 
in Section [71 

4.1 A General Technical Tool 

We use a lower bound technique introduced in Cai and Zhou (2012), which is particularly 
well suited for treating "two-directional" problems such as matrix estimation. The tech- 
nique can be viewed as a generalization of both Le Cam's method and Assouad's Lemma, 
two classical lower bound arguments. Let X be an observation from a distribution Pg where 

6 belongs to a parameter set Q which has a special tensor structure. For a given positive 
integer r and a finite set B C W/{Oi^p}, let F = {0, 1}'' and A C B"". Define 

e = r»A = {(7,A) :7Grand AG A}. (17) 

In comparison, the standard lower bound arguments work with either F or A alone. For 
example, the Assouad's Lemma considers only the parameter set F and the Le Cam's 
method typically applies to a parameter set like A with r = 1. Cai and Zhou (2012) gives a 
lower bound for the maximum risk over the parameter set Q to the problem of estimating 
a functional ip{9), belonging to a metric space with metric d. 

We need to introduce a few notations before formally stating the lower bound. For 
two distributions P and Q with densities p and q with respect to any common dominating 
measure fi, the total variation affinity is given by||PAQ|| = /pA qdp,. For a parameter 

7 = (71, ...,7r) S F where 7j G {0, 1}, define 



r 



be the Hamming distance on {0, l}*^. 



i=l 
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Let Da =Card(A). For a given a G {0,1} and 1 < i < r, we define the mixture 
distribution ¥a,i by 



^^{Pg: ^m = a}. (19) 



So Fa^i is the mixture distribution over all Pq with 7.4 (^) fixed to be a while all other 
components of vary over all possible values. In our construction of the parameter set 
for establishing the minimax lower bound, r is the number of possibly non-zero rows in 
the upper triangle of the covariance matrix, and A is the set of matrices with r rows to 
determine the upper triangle matrix. 

Lemma 2. For any estimator T of ip{6) based on an observation from the experiment 
{Pg, e &}, and any s > 

max2'E0d' {T,ip{9))>a^ min ||Po,i A Pi,i|| (20) 

2 l<i<r ' 

where Pa,i is defined in Equation l[19\) and a is given by 

{{e,e'):H{^{e)rt{B'))>i} H{-i{e),-i{e')) 



We introduce some new notations to study the affinity ||Po,j APi^.j|| in Equation (|20|) . 
Denote the projection of 6* e to F by 7 (0) = (7i (^))i<j<r and to A by \{9) = 
i<i<r- More generally we define — {li {d))i^A ^ subset ^ C {1, 2, . . . ,r}, 

a projection of to a subset of F. A particularly useful example of set A is 

{-{} = {1, . . . ,i _ i^i + 1,. . . ,r} , 

for which 7_i {B) = (71 (6^) , . . . , 7j_i {6) , 7^+1 {6) , 7,, {9)). \a (0) and A_i (9) are defined 
similarly. We denote the set {A^ (9) : 9 £ 6} by A^. For a £ {0,1}, b £ {0, lY~^, and 
c e A_i C S^-i, let 

DA^(a,b,c) = Card {7 G A : ji{9) = a,j-i{9) = b and A^^^^) = c} 

and define 

F(a,i,fe,c) = 77^ : 7i{0) = a, 7-i(^) = b and A_i(0) = c}. (22) 

In other words, P(a,j,fe,c) is the mixture distribution over all Pg with Xi{9) varying over all 
possible values while all other components of 9 remain fixed. 

The following lemma gives a lower bound for the affinity in Equation (j20p . See SEction 
2 of Cai and Zhou (2012) for more details. 

Lemma 3. Let P^^j and P(a,j,6,c) be defined in Equation il9\} and i22\) respectively, then 

||Po,i APi,i|| > Average ||P(o,^,^_„a.o A (P(i,,,^_,,a_o) || , 
where the average over and A_i is induced by the uniform distribution over Q. 
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4.2 Lower Bound for Estimating Sparse Precision Matrix 



We now apply the lower bound technique developed in Section 14.11 to establish rate sharp 
results under the matrix lyj norm. Let Xi, . . . ,X„ ~' Np{fi,Q,~^) with p > cin^ for some 
/3 > 1 and ci > 0, where € Gq{cn,p, Mn,p)- The proof of Theorem [5] contains four 
major steps. We first reduce the minimax lower bound under the general matrix £yi; norm, 
1 < < CO, to under the spectral norm. In the second step we construct in detail a 
subset J> of the parameter space Gq{cn,p, M^^p) such that the difficulty of estimation over 
-F* is essentially the same as that of estimation over Qq{cn,p, M^^p), the third step is the 
application of Lemma [2] to the carefully constructed parameter set, and finally in the fourth 
step we calculate the factors a defined in ([2T]) and the total variation affinity between two 
multivariate normal mixtures. We outline the main ideas of the proof here and leave detailed 
proof of some technical results to Section [71 

Proof of Theorem [5} We shall divide the proof into four major steps. 

Step 1: Reducing the general problem to the lower bound under the spectral 

norm. The following lemma implies that the minimax lower bound under the spectral 
norm yields a lower bound under the general matrix norm up to a constant factor 4. 

Lemma 4. Let *~ N{fi,Q~^), and T he any parameter space of precision 

matrices. The minimax risk for estimating the precision matrix il. over T satisfies 



inf sup E 
n T 



n-n 



2 1 

> - inf sup E 

w 4 n jF 



n-n 



(23) 



for all 1 < w < oo. 



Step 2: Constructing the parameter set. Let r = \p/2\ and let B be the collection of 
all vectors {bj)i^j^p such that bj = for I < j < p — r and bj = or 1 for p — r+1 < j < p 
under the constraint ||6||q = k (to be defined later). For each b £ B and each 1 < m < r, 
define a px p matrix \m{b) by making the mth row of \m{b) equal to b and the rest of the 
entries 0. It is clear that Card(i?)= (^). Set T = {0, 1}*^. Note that each component bi of 
A = (6i, 6,.) £ A can be uniquely associated with a px p matrix Xi{bi). A is the set of all 
matrices A with the every column sum less than or equal to 2k. Define B = F (8) A and let 
^n,p € K be fixed. (The exact value of e„,^p will be chosen later.) For each ^ = (7, A) G 



with 7 = (71 , , 



, 7,.) and A = (1 
n{9) 



, br), we associate 9 with a precision matrix il.{9) by 



M, 



n,p 



Ip ~\~ ^n,p ^ ^ T"iAm(&m) 
m=l 



Finally we define a collection J^* of precision matrices as 



n{9) : n{9) 



n,p 



Ip ~\~ ^n,p ^ ^ '^m^mibn 



m=l 



= (7,A)Ge 
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We now specify the values of e^p and k. Set 



v\l , for some U < < mm 

n 



2 



and 



(24) 



(25) 



which is at least 1 from Equation ()24p . Now we show J> is a subset of the parameter space 
Gq{cn,p, Mn^p). From the definition of k in (I25p note that 



maxV] \ujij\'' < 2 • 2 Vn,p {Mn,pen,pY 



M, 



2 "^"iP ) — ^n,p- 



(26) 



From Equation (jl6p we have Cn,p = o (Mn,pn 2' (logp) 2 V which implies 



2ken,p < Cn,pei-m-l = 0(1/ log p) , 



then 



Mr, 



max^ Iwijl < (1 + 2k€n,p) < Mn 

j 

Since ||^||2 < ll^lli) we have 

r 

{bm) 



(27) 

(28) 



m=l 



< 



m=l 



< 2ken,p = 0(1) , 



which implies that every Q{9) is diagonally dominant and positive definite, and 
Amax i^) < (1 + 2A;e„,p) , and Amin {^) > ^^^^ (1 - 2/ce„,p) 



which immediately implies 



Amax (^) 



< Ml. 



Amin (f^) 

Equations ([26]), ([281), and ([SOD ah together imply C ^g(c„,p, M„,p). 



(29) 
(30) 



Step 3: Applying the general low^er bound argument. Let Xi, . . . , Xn ^ Np ^0, {Q.{0)Y 
with 6 ^ Q and denote the joint distribution by Pq. Applying Lemmas [2] and [3] to the pa- 
rameter space 0, we have 



infmax2 Eg 

n see 



P 



n - n{e) > « • f • min Average ||P(om-„A-,) A P(i,i,^_,,A_,)|| (31) 



where 



a 



mm 



(32) 
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and Pq i and Pi j are defined as in p9|) . 



Step 4: Bounding the per comparison loss a defined in (I32p and the affinity 

minAverage ||P(o,i,7_i,A_i) ^ iP'{i,i,7_i,A_,) || in (I3ip . This is done separately in the next two 

lemmas which are proved in detailed in Section [71 



Lemma 5. The per comparison loss a defined in satisfies 



Q > 



4p 



Lemma 6. Let Xi, . . . , X„ ~ N yO, {^}{0))^ j with 9 £ Q and denote the joint distribution 
by Fg. For a G {0, 1} and 1 < i < r, define P(a,i,6,c) in (2^1. Then there exists a constant 
ci > such that 

minAverage ||P(o,i,-y_,,A_,) A P(i,i,^_^,A_,) II > ci- 

Finally, the minimax lower bound for estimating a sparse precision matrix over the 
collection Gq{cn,p, Mn^p) is obtained by putting together ([3T]l and Lemmas [5] and [U 



inf sup E 



n - n{9) 



> max Eg 

2 n{e)eT, 

ci 



n-n{9) 



> 



{Mn,pken,p) p_ 

4p ' 16 
logp^ ^"'^ 



Cl 



n 



for some constant C2 > 0. □ 
Putting together the minimax upper and lower bounds in Theorems [3] and [5] as well 
as Remark [5] yields the optimal rates of convergence for estimating over the collection 
of the £q balls Qq{cn,p-,Mn^p) defined in ([T|) as well as the collection of the weak (.q balls 
Gq{cn,p,Mn,p) defined in ^ 



Theorem 6. Suppose we observe a random sample 'A'p(/i, S), i = 1,2, ... ,n. Let 

$7 = S"-*^ be the precision matrix. Assume that logp = 0{n^^^) and 



logp 



n 



< c„,p = o tepn'a" (logp) '2' 



for some constant c > 0. Then 



inf sup E 

n neg 



n-n 



^'■^n,p n,p 



logp 



n 



l-q 



(33) 



(34) 



for all 1 < w < 00, where Q = Qq{cn,p, Mn,p) or Gq{cn,p, Mn^p). 
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5 Numerical results 



In this section, we consider the numerical performance of ACLIME. In particular, we shall 
compare the performance of ACLIME with that of CLIME. The following three graphical 
models are considered. Let D = diag(C/i, . . . , Up), where Ui, 1 < i < p, are i.i.d. uniform 
random variables on the interval (1, 5). Let S = i}^^ = D^l^^D. The matrix D makes the 
diagonal entries in S and O different. 

• Band graph. Let = (uij), where uja = 1, Ui^i+i = Wj+i,^ = 0.6, a;i,i+2 = Wj+2,j = 
0.3, iOij = for \i — j\ > 3. 

• AR(1) model. Let Qi = {coij), where ujij = (0.6)l-'-*l . 

• Erdos-Renyi random graph. Let ^2 = i^ij)^ where Uij = Uij * 6ij, 6ij is the 
Bernoulli random variable with success probability 0.05 and Uij is uniform random 
variable with distribution U{0A,0.8). We let 0.1 = 0,2 + (I min(Amin)| + 0.05)/p. It is 
easy to check that the matrix Qi is symmetric and positive definite. 

We generate n = 200 random training samples from A'p(0, S) distribution for p = 
50, 100, 200. For ACLIME, we set 5 = 2 in Step 1 and choose 6 in Step 2 by a cross 
validation method. To this end, we generate an additional 200 testing samples. The tuning 
parameter in CLIME is selected by cross validation. Note that ACLIME chooses different 
tuning parameters for different columns and CLIME chooses a universal tuning parameter. 
The log-likehood loss 

ft) = log(det(0)) - (Si, n), 

where Si is the sample covariance matrix of the testing samples, is used in the cross 
validation method. For in Q, we let S = 6j = j/50, 1 < j < 100. For each 6j, ACLIME 
Cl(6j) is obtained and the tuning parameter (5 in ([9]) is selected by minimizing the following 
log-likehood loss 

6 = j/50, where j = argminL(Si, r2((5j)). 

l<j<100 

The tuning parameter A„, in CLIME is also selected by cross validation. The detailed steps 
can be found in Cai, Liu and Luo (2011). 

The empirical errors of ACLIME and CLIME estimators under various settings are 
summarized in Table [1] below. Three losses under the spectral norm, matrix li norm and 
Frobenius norm are given to compare the performance between ACLIME and CLIME. As 
can be seen from Table [H ACLIME, which is tuning-free, outperforms CLIME in most of 
the cases for each of the three graphs. 
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ACLIME CLIME 


p 


50 100 200 


50 100 200 




Spectral norm 


Band 


0.30(0.01) 0.45(0.01) 0.65(0.01) 


0.32(0.01) 0.50(0.01) 0.72(0.01) 


AR(1) 


0.75(0.01) 1.04(0.01) 1.25(0.01) 


0.73(0.01) 1.05(0.01) 1.30(0.01) 


E-R 


0.65(0.03) 0.95(0.02) 2.62(0.02) 


0.72(0.03) 1.21(0.04) 2.28(0.02) 




Matrix £i norm 


Band 


0.62(0.02) 0.79(0.01) 0.94(0.01) 


0.65(0.02) 0.86(0.02) 0.99(0.01) 


AR(1) 


1.19(0.02) 1.62(0.02) 1.93(0.01) 


1.17(0.01) 1.59(.01) 1.89(0.01) 


E-R 


1.47(0.08) 2.15(0.06) 5.47(0.05) 


1.53(0.06) 2.34(0.06) 5.20(0.04) 




Frobenius norm 


Band 


0.80(0.01) 1.61(0.02) 3.11(0.02) 


0.83(0.01) 1.73(0.02) 3.29(0.03) 


AR(1) 


1.47(0.02) 2.73(0.01) 4.72(0.01) 


1.47(0.02) 2.82(0.02) 4.97(0.01) 


E-R 


1.53(0.05) 3.15(0.03) 9.89(0.07) 


1.62(0.04) 3.61(0.05) 8.86(0.04) 



Table 1: Comparisons of ACLIME and CLIME for the three graphical models under three 
matrix norm losses. Inside the parentheses are the standard deviations of the empirical 
errors over 100 replications. 
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6 Discussions 



We established in this paper the optimal rates of convergence and introduced an adaptive 
method for estimating sparse precision matrices under the matrix norm losses for 1 < 
w < oo. The minimax rate of convergence under the Frobenius norm loss can also be easily 
established. As seen in the proof of Theorems [2] and O with probability tending to one, 



\n-n\^<CMn,p\l^^, (35) 

n 



for some constant C > 0. From Equation (j35p one can immediately obtain the following 
risk upper bound under the Frobenius norm, which can be shown to be rate optimal using 
a similar proof to that of Theorem [5j 

Theorem 7. Suppose we observe a random sample Xi^'^'Np(fi, T,), z = 1, 2, . . . , n. Let fl = 
be the precision matrix. Under the assumption i33]). the minimax risk of estimating 
the precision matrix over the class Qq{cn,p-,Mn,p) defined in (OP satisfies 



inf sup E— 

^ Gq{Cn,p,Mn,p) P 



n-n 



„2 

2 „ /InprTjN 2 



\ogp\ ^ 



n 



As shown in Theorem [H the optimal rate of convergence for estimating sparse precision 
matrices under the squared norm loss is Mn^p^'^c^ p ^^^^^ • It is interesting to compare 
this with the minimax rate of convergence for estimating sparse covariance matrices under 
the same loss which is c^pi-^^j (cf. Theorem 1 in Cai and Zhou (2012)). These 



two convergence rates are similar, but have an important distinction. The difficulty of 
estimating a sparse covariance matrix does not depend on the ii norm bound Mn,p, while 
the difficulty of estimating a sparse precision matrix does. 

As mentioned in the introduction, an important related problem to the estimation of 
precision matrix is the recovery of a Gaussian graph which is equivalent to the estimation 
of the support of ft. Let G = {V,E) be an undirected graph representing the conditional 
independence relations between the components of a random vector X. The vertex set V 
contains the components oi X, V = X = {Vi, . . . , Vp}. The edge set E consists of ordered 
pairs {i,j), indicating conditional dependence between the components Vi and Vj. An edge 
between Vi and Vj is in the set E, i.e., {i,j) S E, if and only Uij = 0. The adaptive CLIME 
estimator, with an additional thresholding step, can recover the support of Q. Define the 
estimator of the support of Q by 

SXJPP(J1) = {(i,j) : \uij\>nj}, 

where the choice of r^j depends on the bound \ujij — ujij\. Equation (|35p implies that the 
right threshold levels = CMn,p^/logp/n. If the magnitudes of the nonzero entries 
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exceed 2CM„,py^logp/n, then SUPP(ri) recovers the support of exactly. In the context of 
covariance matrix estimation, Cai and Liu (2011) introduced an adaptive entry-dependent 
thresholding procedure to recover the support of S. That method is based on the sharp 
bound 



where 9ij is an estimator of Var((Xj — fii){Xj — fij)). It is natural to ask whether one can 
use data and entry-dependent threshold levels r^j to recover the support of ft. It is clearly 
that the optimal choice of Tij depends on the sharp bounds for \ujij — ujij\ which are much 
more difficult to establish than in the covariance matrix case. 

Several recent papers considered the estimation of nonparanormal graphical models 
where the population distribution is non-Gaussian, see Xue and Zou (2012) and Liu, et al. 
(2012). The nonparanormal model assumes that the variables follow a joint normal distri- 
bution after a set of unknown marginal monotone transformations. Xue and Zou (2012) 
estimated the nonparanormal model by applying CLIME (and graphical lasso, neighbor- 
hood Dantzig selector) to the adjusted Spearman's rank correlations. ACLIME can also be 
used in such a setting. It would be interesting to investigate the properties of the resulting 
estimator under the nonparanormal model. Detailed analysis is involved and we leave this 
as future work. 

7 Proofs 

In this section we prove the main results. Theorems [2] and [3l and the key technical results. 
Lemmas HI [5] and [6l used in the proof of Theorem \E\ The proof of Lemma [6] is involved. 
We begin by proving Lemma [1] stated in Section [2] and collecting a few additional technical 
lemmas that will be used in the proofs of the main results. 

7.1 Proof of Lemma [1] and Additional Technical Lemmas 

Proof of Lemma [1] Let S = (fjjj) = Yl^Zi ^k^'k- Note that S* has the same 
distribution as that of S with ~ iV(0,S). So we can replace S* in Section 2 by 



tn = t + n~^Ipy,p and assume Xk ~ iV(0, S). Let ^„ = 1 - O ( (logi9)-^/2p-5V4+i j ^nd 




set A„ = 5y/logp/n + 0((n logp) ^/'^). It suffices to prove that with probability greater 
than An, 




(36) 
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Note that Cov(X^O) = Q, \/ar{Xf^u).j) = cojj and Cov(XjfcjX^a;.j) = X^^^i crik^kj = for 
i 7^ j. So Xki and Xf^uj.j are independent. Hence, E(XfejX^(x!.j)'^ = 0. By Theorem 5.23 
and (5.77) in Petrov (1995), we have 



n-l 



k=l 



= (1 + o(l))P (|Af(0, 1)1 > < C(logij)-i/V''/2_ (37) 

We next prove the second inequahty in ([36|) . We have E(XfcjX^a;.j) = 1 and Var(XfcjX^a;.j) = 
CTjjOJjj + 1. Note that Eexp(io(^fcj^^<^ j)^/(l + (^jj^jj) < cq for some absolute constants 
to and cq. By Theorem 5.23 in Petrov (1995), 



n-l 



^ Xi^jX'^uj.j -n + 1 >SJ {ajjUjj + 1) logpj < C{logpy^/^p 



-1/2-52/2 



(38) 



k=l 

Since 1 = E,{XkjXf^io.j) < E.^^'^iXkjXf^io.j)'^ < cj^^t^j^^, we have crjjLOjj > 1. This, together 
with §7^ and yields (l36]l . 

Lemma 7. Let il. be any estimator of Vt and set t„ = |J7 — il|oo. Then on the event 

{\Qj.j\i<\u.j\, forl<j<p}, 

we have 

< 12c„X~''- (39) 

Proof. Define 

hj = uj.j — Lo.j, /i] = {ujijl{\(2}ij\ > 2tn}; 1 < i < p)^ — ojj, h'j = hj — /i]. 

Then 

- + l^jli < l^-j + + = \Ld.j\i < \uj.j\i, 

which implies that < \h^\i- This follows that \hj\i < 2|/ij|i. So we only need to upper 
bound \hj\i. We have 

p p 
\h]\i < ^ \Lbij - uJij\I{\Lbij\ > 2tn} + ^ < 2t„} 

i=l 1=1 

p p 

- '^^nl{\^ij\ > in} + ^ \uJij\I{\uJij\ < 3tn} < 4.Cn,pti~'^ . 



1=1 



i=l 



So (1391) holds. 



□ 
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The following Lemma is a classical result. It implies that, if we only consider estimators 
of symmetric matrices, an upper bound under the matrix ii norm is an upper bound for 
the general matrix ^w norm for all 1 < < oo, and a lower bound under the matrix £2 
norm is also a lower bound for the general matrix 1^ norm. We give a proof to this lemma 
to be self-contained. 

Lemma 8. Let A he a symmetric matrix, then 

Pll2<PIL<Plll 

for all 1 < w < 00. 

Proof of Lemma [8l The Riesz-Thorin Interpolation Theorem (See, e.g., Thorin, 1948) 
implies 

Pll^ < maxjPII^^ , PII^J , for all 1 < u-i < i« < -^2 < 00. (40) 

Set Wl = 1 and W2 = 00, then Equation (j^Oj) yields ||^||^ < max{||A||-^ , ||y4||^} for all 
1 < w < 00. When A is symmetric, we know \\A\\-^ = \\A\\^, then immediately we have 
ll^llu, < ll^lli- Since 2 is sandwiched between w and and = ||A|| w by duality, 

w—l 

from Equation ()40p we have ||^||2 < ^01 all 1 < w < 00 when A symmetric. □ 

7.2 Proof of Theorems El and H 

We first prove Theorem [2l From Lemma [8] it is enough to consider the w = 1 case. By 
Lemma 1, we have with probability greater than A^^, 



^ r^ur, II , oiioii -1 r'^^P 
< G sZi hW h 2 i2 1 max(T,-o maxw- -a/ . (41) 

n j j V n 



We first assume that maxjijJij > CSyTogp/n. By the above inequality, 

n " i -- j ojjj V n 



max 



^ _ 1 



< (-^Cn,p max u}-- \ h 6M Cn,p max ui-- max 



with probability greater than An- Because Amax(^)/Amin(^) < ^1, we have maxiU)--'^ < 
2(n/ log p)''/^. Thus by the conditions in Theorems [3] and [21 we have 



max 

i 



^ _ 1 



0(1), under conditions of Theorem 3 
O(l/(logp)), under conditions of Theorem 2 



with probability greater than An. By (j36p . we can see that, under conditions of Theorem 
2, Q belongs to the feasible set in (|10p with probability greater than An- Under conditions 
of Theorem 3, O belongs to the feasible set in pup with probability greater than 1 — 
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O ^(logp) ^^'^p <5^/4+i+o(i)^ _ ^y. ^ giniilar argument as in ()4ip . we can get — ^\oo < 
and 



CM, 



1^ ^|oo ^ CMn,p 



logp 



n 



By Lemma [71 we see that 



\(l-9)/2 



We consider the case maxjWjj < O-Sy^logp/n. Under this setting, we have mini<j<j cj*j > 
\/n/ log p with probabihty greater than A^. Hence uja = y^logp/n > cjjj and be- 
longs to the feasible set in (jlOp with probability greater than So ||r2||i < ||^^||i < 
Ccn,p(logp/n)(i-5)/2^ This proves Theorem [2j 

To prove Theorem[31 note that < ||f^^||i < HS""*^!!! < np^/"^. We have 



E 



logp 

n 
logp 



n 



l-q 



+ C{n'p + Mlf'^ci\p 



-<52/4+l+o(l) 



(logp) 



-1/2 



This proves Theorem [3j 



□ 



7.3 Proof of Lemma [4] 

We first show that the minimax lower bound over all possible estimators is at the same 
order of the minimax lower over only estimators of symmetric matrices under each matrix 
^w norm. For each estimator 0, we define a projection of Ct to the parameter space J-", 



^^project = arg min 



which is symmetric, then 



supE 



^project ^ 



< supE 

< supE 
= 4 sup E 



project 



n-n 



+ 



+ 

w 

n-n 



n-n 

2 



(42) 



where the first inequality follows from the triangle inequality and the second one follows 
from the definition of ^project- Since Equation (j42p holds for every Cl, we have 



inf _ supE 
f2, symmetric T 



< 4 inf sup E 
o J" 



n-n 
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From Lemma [HI we have 
which, together with Equation (j42p . estabhshes Lemma HI 



inf sup E 
A, symmetric J' 



> inf sup E 
w symmetric T 





2 




n-n 


> inf sup E 


n-n 




2 SI jr 





□ 



7.4 Proof of Lemma [5] 

Let V = {vij be a column vector with length p, and 



Vi 



I, p- \p/2] + l<i<p 
0, otherwise 



i.e., v = {l{p- \p/2] +l<i< p})p^^. Set 

w = (w^) = [n{e) - n{9')] v. 



^^^ken,p- Then there are at 



Note that for each i, if \ji{9) — Ji{9')\ = 1, we have \wi\ = 
least H{'j{9),^{6')) number of elements Wi with \ wi\ = ^^Y^ken,p, which implies 



\\[m - ^{o')]v\\l> H{j{e),j{e')) . 

Since = \p/2] < p, the equation above yields 



n,p 



ke 



n,p 



|2 ^ \\[n{9)-n{9')]v\\l ^ H{j{9)M0'))-i^ken,p)' 



\\ni9)-n{9') 



I.e. 



\\n{e) - n{9')\\' ^ {Mn,pken,p) 

H{j{9),j{9')) - 



Ap 



□ 



when H{-i{9),-i{9')) > I. 
7.5 Proof of Lemma [6] 

Without loss of generality we assume that M^^p is a constant, since the total variance 
affinity is scale invariant. The proof of the bound for the affinity given in Lemma [6] is 
involved. We break the proof into a few major technical lemmas Without loss of generality 
we consider only the case i = 1 and prove that there exists a constant C2 > such that 
W^ifi A iPi,!]! > C2. The following lemma turns the problem of bounding the total variation 
affinity into a chi-square distance calculation on Gaussian mixtures. Define 

0_i = {{b, c) : there exists a 9 £ Q such that 7-i(^) = b and X~i{9) = c} . 

which is the set of all values of the upper triangular matrix n (9) could possibly take, with 
the first row leaving out. 
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Lemma 9. If there is a constant C2 < 1 such that 

Average <! / I ^^M^^^ ) ^P(i,o,._.A_.) -l\<cl (43) 

(7-i,A_.i)Ge_.i V«Jl^(l,0,7-i,A-i) 

then ||iPi,o AlPi,i|| > 1 - C2 > 0. 

From the definition of P(i,o,7_i,A_i) Equation ([22]) and 9 in Equation (fT7|) . 71 = 
impHes IP(i,o,7_i,A_i) is a single multivariate normal distribution with a precision matrix, 

n„=( ' O-'-" ) (44) 

where S(p_i)x(p_i) = (sij)2<i ,^.<p is uniquely determined by (7.1, A_i) = ((72, 7r), (A2, A,, 
with 

1, i=j 
en,p, li = Ai (j) = 1 . 
0, otherwise 

Let 

Ai (c) = {a : there exists a 9 £ Q such that Xi{9) = a and A_i(0) = c} , 

which gives the set of all possible values of the first row with rest of rows given, i.e., A_i(0) = 
c, and define pa_i = Card (Ai (A_i)), the cardinality of all possible Ai such that (Ai, A_i) G 
A for the given A_i. Then from definitions in Equations (122^ and (I17p P(i,i,'y_i,A_i) is an 
average of (^^^^) multivariate normal distributions with precision matrices of the following 
form 

ax(p-l) 




S(p-l)x(p-l) 



(45) 



where ||r||Q = k with nonzero elements of r equal e„^p and the submatrix S(p_i)x(p-i) is 
the same as the one for So given in (I44p . It is helpful to observe that px_^ > p/4. Let 
'^A-i be the number of columns of A_i with column sum equal to 2k for which the first 
row has no choice but to take value in this column. Then we have Pa_i = Ipf^] ~ ^A_i- 
Since n\_^ ■ 2k < \p/2\ ■ k, the total number of I's in the upper triangular matrix by the 
construction of the parameter set, we thus have nx_-^ < [p/2] /2, which immediately implies 
px_, = \p/2] - nx_, > \p/2] 12 > p/4. 

With Lemma [9] in place, it remains to establish Equation (j43p in order to prove Lemma 
O The following lemma is useful for calculating the cross product terms in the chi-square 
distance between Gaussian mixtures. The proof of the lemma is straightforward and is thus 
omitted. 
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Lemma 10. Let gi be the density function of N (O, il- ^) for i = 0,1 and 2. Then 
9192 det (/) 



90 



[det (/ - n^^Q^^ {Q2 - ^0) {^1 - ^0))] 



1/2- 



Let ilj, i = 1 or 2, be two precision matrices of the form (j45p . Note that Qi, i = 0,1 
or 2, differs from each other only in the first row/column. Then — i = 1 or 2, has 
a very simple structure. The nonzero elements only appear in the first row/column, and 
in total there are 2k nonzero elements. This property immediately implies the following 
lemma which makes the problem of studying the determinant in Lemma [10] relatively easy. 



Lemma 11. Let Qi, i = 1 and 2, be the precision matrices of the form [J^. Define J to 
be the number of overlapping en^p 's between ill and on the first row, and 



Q = (iv)i<i,j<p = (^2 - ^0) i^i - ^0) ■ 

There are index subsets L^ and Lc in {2,...,p} with Card(/f.) 
Card (/,. Ic) = J such that 



Card (Ic) 



k and 



^ ^n,pi 

0, 



i = j = 1 

i £ Lr and j G Ic 
otherwise 



and the matrix {CI2 — ^0) (^1 — ^0) has rank 2 with two identical nonzero eigenvalues Je^ . 



Let 



R 



ai,a: 



log det (/ - n^^n^^ (O2 - fio) {ni - no)) 



(46) 



where Qq is defined in ()44p and determined by (7-1, A_i), and f^i and have the first row 
Ai and A;^ respectively. We drop the indices Ai, Xi and (7_i,A_i) from Qi to simplify the 
notations. Define 



9_i (01,02) 

= {0, 1}''"^ (8> {c € A__i : there exist 6i £ Q, i = 1 and 2, such that Xi{9i) = a,,, A_i(0i) = c} 

It is a subset of 0_i in which the element can pick both ai and 02 as the first row to form 
parameters in 0. From Lemma [10] the left hand side of Equation (j43p can be written as 



Average < Average 

(7-i,A-i)ee_i [ai,a\gAi{a_i) 



n 



exp(— • R 



1 



Average ^ Average 

Ai.A'iGB I (7_i,A_i)ee-i(Ai,A\) 



,n 



7-i,A_i N 
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The following result shows that i??' ^ is approximately — log det (/ — {^2 — ^0) (^^i ~ ^0)) 



which is equal to —2 log (l — Jen,p) from Lemma [TTl Define 

^i,J = {('^ii -^'i) G Ai (g) Ai : the number of overlapping e„^p's between Aiand A'^^is J| 
Lemma 12. For R^^ ^-^ defined in Equation ?«e /laue 



R 



-2iog(i-^<p)+^tA;,l" 



(47) 



ere i?7 ^ satisfies 



Average 

(Ai,A'i)gAi,j 



Average exp i^R"' ^'^ ^ 

(7_i,A_i)Ge-i(Ai,Ai) 



l,Ai,Ai 



1 + 0(1) 



(48) 



where J is defined in Lemma[Tl[ 



7.5.1 Proof of Equation (l43l) 



We are now ready to establish Equation (143p which is the key step in proving Lemma [H It 
follows from Equation (|47p in Lemma [T2] that 



Average I Average exp(— ^) ~ 1 

Ai,A;gB I {7_i,A_i)Ge-i(Ai,A,) ^ 2 1. 1 



Average < exp [— nlog (l — Je^ p)] • Average 

I ' (Ai,A',)eAi„/ 



Average exp(^i^]'^''^ 

(7_i,A_i)e0_i(Ai,A',) ^ ' 1 



Recall that J is the number of overlapping e^^p's between Si and S2 on the first row. It 
can be shown that J has the hypergeometric distribution with 

P (number of overlapping e„,p's = j) = (^}\ (^]:' ~ ^] / (^]:'] < f — • (49) 



JJ \ k- j J \ k J Vpa_i - k 
Equation (|49p and Lemma [T21 togetehr with Equation (|24|) . imply 

2 



Average < 

(7_i,A_i)Ge-i 



C^IP(l,l,7-i,A_i) 



dl 



(l,0,7-i,A_i) 



C^{l,0,7-i,A_i) - 1 



< 



< (1 + 0(1)) (p"^') 'exp[2j (u^logp)] +0(1) 

i>i 

< c5:(p^-^"y\o(i)<ci, 



(50) 



26 



where the last step follows from v'^ < ^J-, and Jt^ < [cn,p{Mn,pen.p) '^Y = O 



^ log^ p J 



from Equations (f25|l and (fTBll and the condition p > cin^ for some /3 > 1, and C2 



logp 

is a positive constant. | 

7.6 Proof of Lemma [12] 

Let 

A={i- n^^n^^) - no) (Oi - no) [i - {n2 - no) (ni - no)r^ . (5i) 

Since ||r2j — JIqII < ll^^i — ^^olli < '^ken,p = o(l/logp) from Equation (p7|) . it is easy to see 
that 

\\A\\=0{ken,p) = o{l). (52) 

Define 

Rl\''\-' = -logdet {I- A). 
^ as follows 



Then we can rewrite R 



o7-i.->»-l 



- log det (/ - n-^n:^^ {n2 - no) {ni - no)) 

- log det {[I -A] -[I- {n2 - no) (f^i - no)]) 

- log det [/ - (f]2 - no) (Oi - no)] - log det (/ - A) 



(53) 



where the last equation follows from Lemma [TTJ To establish Lemma [T2] it is enough to 
establish Equation (jiS]) . 
Let 



Bi 

B2 



ibi,ij) 



pxp 



i-{ni-no + 1)-^ {n2 -no + 1)'^ 



(^2,*j)pxp = - no) {ni - no) [i - (f^2 - no) {n^ - no)]~^ 



and define 

Ai = {ai^ij) = B1B2. 
Similar to Equation ([52]) . we have ||^i|| = o(l), and write 

logdet (/-^i) - logdet [(/ - ^i)"^ (/ - ^) 

it is enough to show that 

exp 



(54) 



To establish Equation 



n 



-logdet (I -^1) =1 + 0(1), 



(55) 
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and that 



Average expf--logdet {I - Ai) ^ {I - A) j=l + o(l). (56) 



The proof for Equation (|55p is as follows. Write 



^1 — 17q 







(vix(p-i))^ 0(j 



Vlx(p-l) 



, and ^l2—^o 



(p-l)x(p-l) 



'lx(p-l) 







'lx(p-l) 
(p-l)x(p-l) 



where vix(p_i) = {vj)2<j<p satisfies Vj = for 2 < j < p — r and vj = or 1 for 



p — r + 1 < j < p with ||v| 



k, and v 



ix(p-i) 



2<i<P 



satisfies a similar property. 



Without loss of generality we consider only a special case with 



1, p — r + l<j<p — r + k 
0, otherwise 



and V* 



1, p — r + k— J<j<p — r + 2k — J 
0, otherwise 



Note that Bi can be written as a polynomial of — Qq and Q2 — ^0y fmd B2 can be written 
as a polynomial of {il.2 — ^0) {^1 — ^0) ■ By a straightforward calculation it can be shown 
that 

/■ 

O (en,p) , 



and 



< 6 



i = \ and p — r+\<i<p — r + 2k — J, 
or J = 1 and p — r + l<i<p — r + 2k — or i=j = \ 
O (e^^p) , p — r+l<i<p — r + 2k — J, and p — r + l<j<p — r + 2k — J 
0, otherwise 



0, 



i = j = 1 

p — r+l<i<p — r-\-k, and p — r + k — J<j — 1 <p — r + 2k — J 



otherwise 



where T„^p = O (e^^p) , which implies 



O (/ce^ p) , i = 1 and p — r + k — J<j — l<p — r + 2k — J, or i=j = l 

O {J^n,p) 5 J = 1 ^-iid p — r + l<i<p — r + 2k — J 

O {ke'^p) , p — r + l<i<p — r + 2k — J, and p — r + k — J<j — l<p — r + 2k — J 

0, otherwise 



Note that rank(^i) < 2 due to the simple structure of {^2 — ^^o) {^1 ~ ^o)- Let A2 
ia2,ij) with 

O {kel p) ' i = 1 and j = 1 

O (/ce^^p + Jkelp) , p-r+l<i<p-r + 2k-J 

and p — r + k — J<j — l<p — r + 2k — J 
0, otherwise 
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and rank(j42) < 4 by eliminating the non-zero off-diagonal elements of the first row and 
column of Ai , and 



exp 



n 



-logdet(/-^i) 



exp 



n 



- log det (7-^2) 



We can show that all eigenvalues of are O ( J/c^e^ ^ + A;^e^ ^ + A;ei^ p) . Since ken,p 
o (1/ logp) , then 

.A;(logp)^/^ 



n 



■ logp = o (1) 



which implies 



Thus 



n 



exp 



n 



-logdet(/-Ai) =l + o(l) 



Now we establish Equation (|55|) . which, together with Equation 
8]) and thus Lemma [12] is established. Write 



yields Equation 



(/ - Ai)-' [i-A)-I = {I- Ai)-^ [{I -A) -{I- Ai)] = {I- AiY^ [Ai - A) 

{I - Ai)-^ [n^n^^ - {ni -no + ly^ (O2 - Oo + ly^ 



where 



-1 



v-1 



= [{^2 -no + i) {fii - + /) - 021^1] (Oi - + ir^ (O2 -no + i) 

= n^n^ U-Qo + 1)^1 + ^2 i-no + i) + (-O0 + if] (Qi -no + {^2 -^0 + 

It is important to observe that rank ^(/ — Ai)"^ {I — A) — < 2 again due to the simple 

structure of {Q.2 — ^o) (f^i — ^o), then — logdet (/ — Ai)~^ {I — A) is determined by at 
most two nonzero eigenvalues, which are bounded by 



{I - Ai)-\l - A) - I = {l + o{l))\\{I-no){^2-^Q){^i-^a 



(57) 



Note that 



{I - AiY^ {I - A)- I =o(l),and 

|log (1 — x)| < 2 |x| , for |x| < 1/3, 

which implies 



log det 



\i-A,r^{i-A) 



< 2 



[i-A,r\i-A)-i 
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I.e., 



n 



exp ( ■ ~ log dct 



Define 



then 



(/ - Aiy^ (/ - A) ) < exp (n {I - Air'^ {I - A) - I 
A, = {I- Oo) (^^2 - f^o) (^^1 - ^^o) , 



exp(^- • -logdet ^(/ - ^i)"^ (/ - A)J j < exp ((1 + o(l))n 
from Equations ([F7|) . It is then sufficient to show 



Average exp (2n ||^*||) = 1 + o (1) 
(7-i,A-i)ee_i(Ai,A\) 

where \\A^\\ depends on the values of Ai, and (7-1, A_i). We dropped the indices Ai, Xi 
and (7_i,A_i) from A to simphfy the notations. 

Let Em = {1, 2, . . . , r} / {1, ?n}. Let nx^^ be the number of columns of Xe^ with 
column sum at least 2k — 2 for which two rows can not freely take value or 1 in this 
column. Then we have Px^^ = \p/2'] — nx^^. Without loss of generality we assume that 
k > 3. Since nx^^ • {2k — 2) < \p/2~\ ■ k, the total number of I's in the upper triangular 
matrix by the construction of the parameter set, we thus have nx^^ < [p/2] • |, which 
immediately implies Px^^ = Ipf^] — ^a_b^ ^ Ipf^] i ^ P/^- Thus we have 

F {\\A4 > 2t ■ en,p ■ kel^p) < ¥ {\\A,\\^ > 2t ■ en,p ■ kel^^) 



< Average- 



<p 



p/8 - k 



from Equation 



which immediately implies 



Average exp(2n||y4^, 

{7_i,A_i)g0_i(Ai,A'J 

< exp ( 4n • ^^^^ • en,p ■ kel^p ] + exp (2n • 2t ■ tn,p ■ kel^p) p 



exp 



/3 

'^^-%ke^ 



k' 



p/8 - k 



dt 



n,p 



+ 



2(13-1) 



exp 



log p + t ( Anke'^ p — log 



fe2 



p/8 - k 



dt 



= 1 + 0(1), 

where the last step is an immediate consequence of the following two equations. 



nke^^p = 0(1) 
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and 

. XX , , P/8 - 1 - A: , 2(B 
il + o{l))2logp<tlog^ — ,fort> 



A;2 ' - /5 

1 Equation ([25]) and th' 
for some /3 > 1. □ 



which follow from = O (n) = O [p^^^) from Equation (|25|) and the condition p > ci??,'^ 
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